Import packages
library(pastecs)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(psych)
## Warning: package 'psych' was built under R version 3.5.2
library(purrr)
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
## dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
## Referenced from: /Library/Frameworks/R.framework/Versions/3.5/Resources/modules/R_X11.so
## Reason: image not found
## Could not load tcltk. Will use slower R code instead.
## Loading required package: RSQLite
## Warning: package 'RSQLite' was built under R version 3.5.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(igraph)
## Warning: package 'igraph' was built under R version 3.5.2
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(MatchIt)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(tableone)
## Warning: package 'tableone' was built under R version 3.5.2
Import data
df <- read.csv('HighNote Data Midterm.csv')
head(df)
## ID age male friend_cnt avg_friend_age avg_friend_male friend_country_cnt
## 1 1 22 0 8 22.57143 0.4285714 1
## 2 2 35 0 2 28.00000 1.0000000 2
## 3 3 27 1 2 23.00000 1.0000000 1
## 4 4 21 0 28 22.94737 0.5000000 7
## 5 5 24 0 65 22.28302 0.9137931 9
## 6 6 21 1 12 25.00000 0.7777778 1
## subscriber_friend_cnt songsListened lovedTracks posts playlists shouts
## 1 0 9687 194 0 1 8
## 2 0 0 0 0 0 0
## 3 0 508 0 0 1 2
## 4 1 1357 32 0 0 1
## 5 0 89984 20 2 0 81
## 6 0 124547 10 0 1 2
## adopter tenure good_country
## 1 0 59 1
## 2 0 35 0
## 3 0 42 0
## 4 0 25 0
## 5 0 67 0
## 6 0 53 1
#1. Summary statistics
describeBy(df, df$adopter)
##
## Descriptive statistics by group
## group: 0
## vars n mean sd median trimmed
## ID 1 40300 20150.50 11633.75 20150.50 20150.50
## age 2 40300 23.95 6.37 23.00 23.09
## male 3 40300 0.62 0.48 1.00 0.65
## friend_cnt 4 40300 18.49 57.48 7.00 10.28
## avg_friend_age 5 40300 24.01 5.10 23.00 23.40
## avg_friend_male 6 40300 0.62 0.32 0.67 0.65
## friend_country_cnt 7 40300 3.96 5.76 2.00 2.66
## subscriber_friend_cnt 8 40300 0.42 2.42 0.00 0.13
## songsListened 9 40300 17589.44 28416.02 7440.00 11817.64
## lovedTracks 10 40300 86.82 263.58 14.00 36.35
## posts 11 40300 5.29 104.31 0.00 0.23
## playlists 12 40300 0.55 1.07 0.00 0.45
## shouts 13 40300 29.97 150.69 4.00 8.84
## adopter 14 40300 0.00 0.00 0.00 0.00
## tenure 15 40300 43.81 19.79 44.00 43.72
## good_country 16 40300 0.36 0.48 0.00 0.32
## mad min max range skew kurtosis se
## ID 14937.19 1 40300 40299 0.00 -1.20 57.95
## age 4.45 8 79 71 1.97 6.80 0.03
## male 0.00 0 1 1 -0.50 -1.75 0.00
## friend_cnt 7.41 1 4957 4956 32.67 2087.42 0.29
## avg_friend_age 3.95 8 77 69 1.84 7.15 0.03
## avg_friend_male 0.35 0 1 1 -0.52 -0.72 0.00
## friend_country_cnt 1.48 0 129 129 4.74 38.29 0.03
## subscriber_friend_cnt 0.00 0 309 309 72.19 8024.62 0.01
## songsListened 10576.87 0 1000000 1000000 6.05 105.85 141.55
## lovedTracks 20.76 0 12522 12522 13.12 335.93 1.31
## posts 0.00 0 12309 12309 73.92 7005.34 0.52
## playlists 0.00 0 98 98 28.21 1945.28 0.01
## shouts 4.45 0 7736 7736 22.53 779.12 0.75
## adopter 0.00 0 0 0 NaN NaN 0.00
## tenure 22.24 1 111 110 0.05 -0.70 0.10
## good_country 0.00 0 1 1 0.59 -1.65 0.00
## --------------------------------------------------------
## group: 1
## vars n mean sd median trimmed
## ID 1 3527 42064.00 1018.30 42064.00 42064.00
## age 2 3527 25.98 6.84 24.00 25.05
## male 3 3527 0.73 0.44 1.00 0.79
## friend_cnt 4 3527 39.73 117.27 16.00 23.69
## avg_friend_age 5 3527 25.44 5.21 24.36 24.83
## avg_friend_male 6 3527 0.64 0.25 0.67 0.65
## friend_country_cnt 7 3527 7.19 8.86 4.00 5.36
## subscriber_friend_cnt 8 3527 1.64 5.85 0.00 0.84
## songsListened 9 3527 33758.04 43592.73 20908.00 25811.69
## lovedTracks 10 3527 264.34 491.43 108.00 161.68
## posts 11 3527 21.20 221.99 0.00 1.44
## playlists 12 3527 0.90 2.56 1.00 0.59
## shouts 13 3527 99.44 1156.07 9.00 23.89
## adopter 14 3527 1.00 0.00 1.00 1.00
## tenure 15 3527 45.58 20.04 46.00 45.60
## good_country 16 3527 0.29 0.45 0.00 0.23
## mad min max range skew kurtosis se
## ID 1307.65 40301 43827 3526 0.00 -1.20 17.15
## age 4.45 8 73 65 1.68 4.39 0.12
## male 0.00 0 1 1 -1.03 -0.94 0.01
## friend_cnt 17.79 1 5089 5088 26.04 1013.79 1.97
## avg_friend_age 3.91 12 62 50 1.68 5.05 0.09
## avg_friend_male 0.25 0 1 1 -0.54 -0.05 0.00
## friend_country_cnt 4.45 0 136 136 3.61 24.53 0.15
## subscriber_friend_cnt 0.00 0 287 287 34.05 1609.52 0.10
## songsListened 23276.82 0 817290 817290 4.71 46.64 734.03
## lovedTracks 140.85 0 10220 10220 6.52 80.96 8.27
## posts 0.00 0 8506 8506 26.52 852.38 3.74
## playlists 1.48 0 118 118 28.84 1244.31 0.04
## shouts 11.86 0 65872 65872 52.52 2969.09 19.47
## adopter 0.00 1 1 0 NaN NaN 0.00
## tenure 20.76 0 111 111 0.02 -0.62 0.34
## good_country 0.00 0 1 1 0.94 -1.12 0.01
free <- df[df$adopter == 0,]
premium <- df[df$adopter == 1,]
#stat.desc(free)
#stat.desc(premium)
#t-test
output = NULL
for (i in 2:16){
if(i == 14) next()
name = names(free[i])
p_value = t.test(free[i], premium[i])$p.value
output <- rbind(output, data.frame(name, p_value))
}
print(output)
## name p_value
## 1 age 1.207042e-62
## 2 male 1.388253e-41
## 3 friend_cnt 4.360547e-26
## 4 avg_friend_age 9.940507e-54
## 5 avg_friend_male 9.096630e-06
## 6 friend_country_cnt 6.523070e-95
## 7 subscriber_friend_cnt 4.994062e-34
## 8 songsListened 6.334710e-98
## 9 lovedTracks 3.839467e-94
## 10 posts 2.556983e-05
## 11 playlists 8.619207e-16
## 12 shouts 3.673566e-04
## 13 tenure 4.767641e-07
## 14 good_country 1.941292e-18
From the above descriptive statistics, there are 40,300 free users (group 0/adopter = 0) and 3527 premium users (group 1/adopter = 1). For premium users, almost all its variables have higher mean values than that of free users. For instance, the average age of premium users is 25.98, older than that of free users, 23.95. It can be inferred that younger users would like to choose free service on High Note platform due to their income or other relevant reasons. In terms of varaibles of avg_friend_cnt and singListened, for instance, the premium users have more average friend count than free users. Also, they listened more songs (songListened variable), which is 33,758 on average, than 17,589 songs listened by free users. In a nutshell, the overall performance of all variables for premium users is better than free users. It can be assumed that premium users, as opposed to free users, are more likely to be loyal on the High Note music platform.
Doing t-test on all variables for adopters (premium users) and non-adopters (free users), p-values of all variables are statistically significant. It means that there are huge difference between two groups, in terms of all variables.
#2. Data Visualization (i) demographics
#age
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:purrr':
##
## compact
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
df$Adopter <- factor(df$adopter, levels = c(0, 1))
means <- ddply(df, "Adopter", summarise, age.mean = mean(age))
means
## Adopter age.mean
## 1 0 23.94844
## 2 1 25.97987
ggplot(df, aes(x = age, group = Adopter, fill = Adopter)) +
geom_density(alpha = .3) +
geom_vline(data = means, aes(xintercept = age.mean, colour = Adopter),
linetype = "longdash", size=1)
#male
df$Male <- factor(df$male, levels = c(0, 1))
df %>% group_by(Adopter, Male) %>% tally() %>%
ggplot(aes(Male, n, group = Adopter, fill = Adopter)) + geom_bar(position = "dodge", stat = "identity") + labs(x = 'male', y = 'count', title = 'Male Count by Adopter')
#good country
df$Good_country <- factor(df$good_country, levels = c(0, 1))
df %>% group_by(Adopter, Good_country) %>% tally() %>%
ggplot(aes(Good_country, n, group = as.factor(Adopter), fill = Adopter)) + geom_bar(position = "dodge", stat = "identity") + labs(x = 'good_country', y = 'count', title = 'Count of Good Country by Adopter')
ggplot(df, aes(x=Adopter, y = Good_country))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
## No summary function supplied, defaulting to `mean_se()`
After descriptive statistics, I checked the difference of all variables through visualized plots. First, I looked to the visualization of demographic variables, including age, male, and good country. The age for two types of users are all right skewed, yet the free users are younger than the premium users, in general. As for the number of males, the premium users have more male than free users, while for the users from good country, it seems that free users come from good country more than premium users do.
#mean value of peer influence variables
col <- c(colnames(df)[4:8], 'adopter')
peer_mean<- df %>%
group_by(adopter) %>%
select(one_of(col)) %>%
summarise_all(funs(mean))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
##
## # Before:
## funs(name = f(.)
##
## # After:
## list(name = ~f(.))
## This warning is displayed once per session.
peer_mean
## # A tibble: 2 x 6
## adopter friend_cnt avg_friend_age avg_friend_male friend_country_…
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 0 18.5 24.0 0.617 3.96
## 2 1 39.7 25.4 0.637 7.19
## # … with 1 more variable: subscriber_friend_cnt <dbl>
#distribution
ggpairs(df[4:8])
#plot mean value to see difference in each variable
p1<- ggplot(df, aes(x=Adopter, y = friend_cnt, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p2<- ggplot(df, aes(x=Adopter, y = avg_friend_age, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p3<-ggplot(df, aes(x=Adopter, y = avg_friend_male, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p4<-ggplot(df, aes(x=Adopter, y = friend_country_cnt, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p5<-ggplot(df, aes(x=Adopter, y = subscriber_friend_cnt, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
grid.arrange(p1, p2,p3,p4,p5, nrow = 3)
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
In terms of variables related to peer influence, it includes friend_cnt, avg_friend_age, avg_friend_male, friend_country_cnt, subscriber_friend_cnt. Below is the barplot and distribution of mean values of all variables. They all have skewed distributions, based on the distribution matrix. In overall, the premium users have higher mean values of peer influence than free users have. That is to say premium users are more likely to be affected by their friends and also interact more with their friends as well.
df %>%
group_by(adopter) %>%
select(one_of(colnames(df)[9:15])) %>%
summarise_all(funs(mean))
## # A tibble: 2 x 7
## adopter songsListened lovedTracks posts playlists shouts tenure
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 17589. 86.8 5.29 0.549 30.0 43.8
## 2 1 33758. 264. 21.2 0.901 99.4 45.6
#distribution
ggpairs(df[9:13])
#plot mean value to see difference in each variable
p1<- ggplot(df, aes(x=Adopter, y = songsListened, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p2<- ggplot(df, aes(x=Adopter, y = lovedTracks, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p3<-ggplot(df, aes(x=Adopter, y = posts, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p4<-ggplot(df, aes(x=Adopter, y = playlists, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p5<-ggplot(df, aes(x=Adopter, y = shouts, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
p6<-ggplot(df, aes(x=Adopter, y = tenure, fill = Adopter))+ geom_bar(stat = 'summary', fun.y = 'mean')+xlab("adopter")
## Warning: Ignoring unknown parameters: fun.y
grid.arrange(p1, p2,p3,p4,p5,p6, nrow = 3)
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
For the user engagement variables, it includes songListened, lovedTracks, posts, playlists, shouts, and tenure. T premium users have higher user engagement than free users, in terms of all mean values of these variables. That is to say, the premium users are more active on the High Note music platform.
#3. Propensity Score Matching (PSM) Use PSM to test whether having subscriber friends affects the likelihood of becoming an adopter (i.e. fee customer).
#split treatment and control group
df <- read.csv('HighNote Data Midterm.csv')
df <- mutate(df, treatment = ifelse(subscriber_friend_cnt>=1, 1, 0))
#run t-test before PSM
with(df, t.test(adopter ~ treatment)) #c-t: 0.052 - 0.178
##
## Welch Two Sample t-test
##
## data: adopter by treatment
## t = -30.961, df = 11815, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1330281 -0.1171869
## sample estimates:
## mean in group 0 mean in group 1
## 0.05243501 0.17754250
Before doing propensity score matching, first, I split treatment group and control group based on the definition of whether users that have one or more subscriber friends. Then, I did t-test to check if treatment group and control group have significant difference on the behavior of choosing free and premium service (adopter). It ended up getting the p-value with statistically significant. So, two groups are different between adopter variable. The mean value of adopter for treatment group is 0.178 and that of control group is 0.052 or so. Therefore, it is necessary to do propensity score matching to make two groups more balance.
#logistic regression for PSM To make sure all the other variables are significant to control group and treatment group, I did logistic regression for all variables, excluding variables related to treatment and adopter. According to the distribution plots in the second session, we knew the distributions of all continuous variables are skewed. Thus, I took log transformation before doing logistic regression. The result shows that, if the alpha of model is 0.05, then all variables are significantly correlated to the dependent variable - the treatment. AIC is 31493.
logit<- glm(treatment ~log(age) + male +log(friend_cnt+1) + log(avg_friend_age+1) + log(avg_friend_male+1) +
log(friend_country_cnt+1) + log(songsListened+1) + log(lovedTracks+1) + log(posts+1) +
log(playlists+1) + log(shouts+1) + log(tenure+1) + good_country, family = binomial(), data = df)
summary(logit)
##
## Call:
## glm(formula = treatment ~ log(age) + male + log(friend_cnt +
## 1) + log(avg_friend_age + 1) + log(avg_friend_male + 1) +
## log(friend_country_cnt + 1) + log(songsListened + 1) + log(lovedTracks +
## 1) + log(posts + 1) + log(playlists + 1) + log(shouts + 1) +
## log(tenure + 1) + good_country, family = binomial(), data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5576 -0.5682 -0.2997 -0.1170 3.3903
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -17.758481 0.301395 -58.921 < 2e-16 ***
## log(age) 0.647953 0.095402 6.792 1.11e-11 ***
## male 0.078393 0.031386 2.498 0.0125 *
## log(friend_cnt + 1) 1.078788 0.027146 39.740 < 2e-16 ***
## log(avg_friend_age + 1) 3.486321 0.125283 27.828 < 2e-16 ***
## log(avg_friend_male + 1) 0.381212 0.090651 4.205 2.61e-05 ***
## log(friend_country_cnt + 1) 0.560058 0.032010 17.496 < 2e-16 ***
## log(songsListened + 1) 0.051766 0.009154 5.655 1.56e-08 ***
## log(lovedTracks + 1) 0.084539 0.007936 10.652 < 2e-16 ***
## log(posts + 1) 0.079169 0.014703 5.385 7.26e-08 ***
## log(playlists + 1) -0.152139 0.035585 -4.275 1.91e-05 ***
## log(shouts + 1) -0.028970 0.013678 -2.118 0.0342 *
## log(tenure + 1) -0.367109 0.031064 -11.818 < 2e-16 ***
## good_country 0.059487 0.030370 1.959 0.0501 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 46640 on 43826 degrees of freedom
## Residual deviance: 31465 on 43813 degrees of freedom
## AIC: 31493
##
## Number of Fisher Scoring iterations: 6
After logistic regression, we use the model to predict propensity scores of treatment and control group.
#predicted propensity score
prs_df <- data.frame(pr_score = predict(logit, type = "response"),
treatment = logit$model$treatment)
head(prs_df)
## pr_score treatment
## 1 0.07337272 0
## 2 0.04897846 0
## 3 0.02140195 0
## 4 0.42334070 1
## 5 0.64329312 0
## 6 0.15366186 0
I plotted the propensity score by two groups, which refers to the following distribution plots. It shows that the treatment group has consistent number of users among different levels of propensity scores. Yet the control group has right skewed distribution. It has more people with lower propensity score, who are less likely to turning into adopters, that is, premium users.
#plot histogran of treatment status to examine the region of common support
labs <- paste("Actual treatment type:", c("Treatment w/ subscriber friends", "Control w/n subscriber friends"))
prs_df %>%
mutate(treatment = ifelse(treatment == 1, labs[1], labs[2])) %>%
ggplot(aes(x = pr_score)) +
geom_histogram(color = "white") +
facet_wrap(~treatment) +
xlab("Probability of getting subscriber friends") +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#create a prematching table
xvars <- colnames(df[!colnames(df) %in% c('ID', 'subscriber_friend_cnt', 'adopter', 'treatment')])
table1 <- CreateTableOne(vars = xvars, strata = "treatment", data = df, test = FALSE)
print(table1, smd = TRUE)
## Stratified by treatment
## 0 1
## n 34004 9823
## age (mean (SD)) 23.75 (6.22) 25.37 (6.97)
## male (mean (SD)) 0.63 (0.48) 0.64 (0.48)
## friend_cnt (mean (SD)) 10.43 (15.28) 54.02 (127.91)
## avg_friend_age (mean (SD)) 23.76 (5.06) 25.39 (5.17)
## avg_friend_male (mean (SD)) 0.61 (0.33) 0.64 (0.23)
## friend_country_cnt (mean (SD)) 2.73 (3.10) 9.39 (10.01)
## songsListened (mean (SD)) 14602.22 (23214.29) 33735.64 (43952.34)
## lovedTracks (mean (SD)) 65.21 (181.48) 225.36 (498.23)
## posts (mean (SD)) 2.54 (33.79) 20.52 (241.27)
## playlists (mean (SD)) 0.53 (0.97) 0.74 (1.96)
## shouts (mean (SD)) 16.42 (79.74) 101.82 (739.51)
## tenure (mean (SD)) 43.20 (19.72) 46.55 (19.92)
## good_country (mean (SD)) 0.35 (0.48) 0.34 (0.47)
## Stratified by treatment
## SMD
## n
## age (mean (SD)) 0.246
## male (mean (SD)) 0.015
## friend_cnt (mean (SD)) 0.479
## avg_friend_age (mean (SD)) 0.319
## avg_friend_male (mean (SD)) 0.079
## friend_country_cnt (mean (SD)) 0.899
## songsListened (mean (SD)) 0.544
## lovedTracks (mean (SD)) 0.427
## posts (mean (SD)) 0.104
## playlists (mean (SD)) 0.139
## shouts (mean (SD)) 0.162
## tenure (mean (SD)) 0.169
## good_country (mean (SD)) 0.024
According to the pre-matching table as below, some variables between treatment group and control group are significantly different, in terms of mean difference. (*When SMD of a variable is larger than 0.1, we can say that this variable between two groups are different.)
Propensity Score Matching – Method 1: nearest neighbor matching After finding differe in the treatment status, use MatchIt to find pairs of observations that have very similar propensity scores
With predicted propensity score from logistic regression, I first used the nearest method to match treatment and control samples. The nearest method is to pair a treated subject to a control who is close in terms of its distance in covariate space. The result shows that 9823 samples got matched. For all data without matching, the distances (= propensity score) between treatment and control are 0.4982 and 0.1450. However, the distances of matched data between treatment and control are 0,4982 and 0.3447. The mean difference of distance of the matched data is 0.1535, lower than the distance of all data, 0.3532. There are 56.54% balance improvement of mean different of distance.
#checl null value and omit
sum(is.na(df)) #no null value
## [1] 0
#matchit
mod_match <- matchit(treatment ~ log(age) + male +log(friend_cnt+1) + log(avg_friend_age+1) + log(avg_friend_male+1) + log(friend_country_cnt+1) + log(songsListened+1) + log(lovedTracks+1) + log(posts+1) + log(playlists+1) + log(shouts+1) + log(tenure+1) + good_country, method = 'nearest', data = df)
summary(mod_match, covariates = T)
##
## Call:
## matchit(formula = treatment ~ log(age) + male + log(friend_cnt +
## 1) + log(avg_friend_age + 1) + log(avg_friend_male + 1) +
## log(friend_country_cnt + 1) + log(songsListened + 1) + log(lovedTracks +
## 1) + log(posts + 1) + log(playlists + 1) + log(shouts + 1) +
## log(tenure + 1) + good_country, data = df, method = "nearest")
##
## Summary of balance for all data:
## Means Treated Means Control SD Control
## distance 0.4982 0.1450 0.1682
## log(age) 3.2014 3.1389 0.2312
## male 0.6363 0.6288 0.4831
## log(friend_cnt + 1) 3.3261 1.9362 0.9467
## log(avg_friend_age + 1) 3.2562 3.1912 0.1847
## log(avg_friend_male + 1) 0.4814 0.4540 0.2270
## log(friend_country_cnt + 1) 1.9993 1.1226 0.5558
## log(songsListened + 1) 9.6020 7.9441 2.7002
## log(lovedTracks + 1) 3.9582 2.4466 1.9781
## log(posts + 1) 0.8391 0.2865 0.7620
## log(playlists + 1) 0.4130 0.3364 0.3943
## log(shouts + 1) 3.0357 1.7159 1.2392
## log(tenure + 1) 3.7453 3.6546 0.5750
## good_country 0.3433 0.3547 0.4784
## Mean Diff eQQ Med eQQ Mean eQQ Max
## distance 0.3532 0.3898 0.3532 0.5491
## log(age) 0.0625 0.0572 0.0628 0.1823
## male 0.0074 0.0000 0.0074 1.0000
## log(friend_cnt + 1) 1.3899 1.4553 1.3899 2.9793
## log(avg_friend_age + 1) 0.0649 0.0654 0.0651 0.4974
## log(avg_friend_male + 1) 0.0274 0.0465 0.0662 0.3102
## log(friend_country_cnt + 1) 0.8767 1.0116 0.8766 1.2993
## log(songsListened + 1) 1.6579 1.2880 1.6583 6.0283
## log(lovedTracks + 1) 1.5116 1.5315 1.5115 2.8332
## log(posts + 1) 0.5526 0.0000 0.5525 2.3214
## log(playlists + 1) 0.0766 0.0000 0.0765 1.0986
## log(shouts + 1) 1.3198 1.5041 1.3197 2.2849
## log(tenure + 1) 0.0907 0.0770 0.0909 1.6094
## good_country -0.0114 0.0000 0.0114 1.0000
##
##
## Summary of balance for matched data:
## Means Treated Means Control SD Control
## distance 0.4982 0.3447 0.1859
## log(age) 3.2014 3.1883 0.2469
## male 0.6363 0.6426 0.4793
## log(friend_cnt + 1) 3.3261 2.8534 0.8519
## log(avg_friend_age + 1) 3.2562 3.2510 0.2004
## log(avg_friend_male + 1) 0.4814 0.4810 0.1552
## log(friend_country_cnt + 1) 1.9993 1.6386 0.6274
## log(songsListened + 1) 9.6020 9.3647 1.7585
## log(lovedTracks + 1) 3.9582 3.5292 1.9237
## log(posts + 1) 0.8391 0.5846 1.0732
## log(playlists + 1) 0.4130 0.3892 0.4143
## log(shouts + 1) 3.0357 2.5247 1.4340
## log(tenure + 1) 3.7453 3.7322 0.5493
## good_country 0.3433 0.3488 0.4766
## Mean Diff eQQ Med eQQ Mean eQQ Max
## distance 0.1535 0.1730 0.1535 0.2964
## log(age) 0.0131 0.0000 0.0148 0.3185
## male -0.0063 0.0000 0.0063 1.0000
## log(friend_cnt + 1) 0.4727 0.4055 0.4727 2.8447
## log(avg_friend_age + 1) 0.0051 0.0185 0.0202 0.2054
## log(avg_friend_male + 1) 0.0004 0.0061 0.0070 0.0496
## log(friend_country_cnt + 1) 0.3607 0.4055 0.3607 1.2040
## log(songsListened + 1) 0.2373 0.2152 0.2377 1.0986
## log(lovedTracks + 1) 0.4290 0.4754 0.4290 0.8702
## log(posts + 1) 0.2545 0.0000 0.2545 1.6584
## log(playlists + 1) 0.0238 0.0000 0.0238 1.6011
## log(shouts + 1) 0.5110 0.5108 0.5110 2.2849
## log(tenure + 1) 0.0131 0.0000 0.0132 1.6094
## good_country -0.0055 0.0000 0.0055 1.0000
##
## Percent Balance Improvement:
## Mean Diff. eQQ Med eQQ Mean eQQ Max
## distance 56.5380 55.6227 56.5362 46.0241
## log(age) 78.9816 100.0000 76.4853 -74.6660
## male 14.9828 0.0000 15.0685 0.0000
## log(friend_cnt + 1) 65.9902 72.1385 65.9896 4.5196
## log(avg_friend_age + 1) 92.0737 71.7658 68.9221 58.6966
## log(avg_friend_male + 1) 98.4996 86.9324 89.4496 84.0090
## log(friend_country_cnt + 1) 58.8518 59.9185 58.8500 7.3356
## log(songsListened + 1) 85.6845 83.2950 85.6681 81.7757
## log(lovedTracks + 1) 71.6172 68.9565 71.6162 69.2871
## log(posts + 1) 53.9493 0.0000 53.9396 28.5570
## log(playlists + 1) 68.9717 0.0000 68.9302 -45.7356
## log(shouts + 1) 61.2805 66.0373 61.2778 0.0000
## log(tenure + 1) 85.5660 100.0000 85.4254 0.0000
## good_country 51.8523 0.0000 51.7857 0.0000
##
## Sample sizes:
## Control Treated
## All 34004 9823
## Matched 9823 9823
## Unmatched 24181 0
## Discarded 0 0
#plot(mod_match)
#create df to save match data
dta_m <- match.data(mod_match)
dim(dta_m)
## [1] 19646 19
#create a TableOne to see Standardized Mean Difference
match_table <- CreateTableOne(vars = xvars, strata = "treatment", data = dta_m, test = FALSE)
print(match_table, smd = TRUE)
## Stratified by treatment
## 0 1
## n 9823 9823
## age (mean (SD)) 25.05 (6.99) 25.37 (6.97)
## male (mean (SD)) 0.64 (0.48) 0.64 (0.48)
## friend_cnt (mean (SD)) 23.37 (22.62) 54.02 (127.91)
## avg_friend_age (mean (SD)) 25.38 (6.04) 25.39 (5.17)
## avg_friend_male (mean (SD)) 0.64 (0.24) 0.64 (0.23)
## friend_country_cnt (mean (SD)) 5.32 (4.52) 9.39 (10.01)
## songsListened (mean (SD)) 26514.56 (30768.27) 33735.64 (43952.34)
## lovedTracks (mean (SD)) 132.15 (276.54) 225.36 (498.23)
## posts (mean (SD)) 6.29 (57.19) 20.52 (241.27)
## playlists (mean (SD)) 0.63 (0.94) 0.74 (1.96)
## shouts (mean (SD)) 39.20 (138.80) 101.82 (739.51)
## tenure (mean (SD)) 46.14 (19.84) 46.55 (19.92)
## good_country (mean (SD)) 0.35 (0.48) 0.34 (0.47)
## Stratified by treatment
## SMD
## n
## age (mean (SD)) 0.046
## male (mean (SD)) 0.013
## friend_cnt (mean (SD)) 0.334
## avg_friend_age (mean (SD)) 0.001
## avg_friend_male (mean (SD)) 0.003
## friend_country_cnt (mean (SD)) 0.524
## songsListened (mean (SD)) 0.190
## lovedTracks (mean (SD)) 0.231
## posts (mean (SD)) 0.081
## playlists (mean (SD)) 0.077
## shouts (mean (SD)) 0.118
## tenure (mean (SD)) 0.020
## good_country (mean (SD)) 0.012
#alternative way to see mean difference
df_colnames <- names(dta_m)[c(-1, -8, -14, -17, -18, -19)]
dta_m %>%
group_by(treatment) %>%
select(one_of(df_colnames)) %>%
summarise_all(funs(mean))
## Adding missing grouping variables: `treatment`
## # A tibble: 2 x 14
## treatment age male friend_cnt avg_friend_age avg_friend_male
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 25.1 0.643 23.4 25.4 0.636
## 2 1 25.4 0.636 54.0 25.4 0.636
## # … with 8 more variables: friend_country_cnt <dbl>, songsListened <dbl>,
## # lovedTracks <dbl>, posts <dbl>, playlists <dbl>, shouts <dbl>,
## # tenure <dbl>, good_country <dbl>
#t=test
output = NULL
for (i in 1:13){
name = df_colnames[i]
p_value = t.test(dta_m[, df_colnames[i]] ~ dta_m$treatment)$p.value
output <- rbind(output, data.frame(name, p_value))
}
print(output) #all significant
## name p_value
## 1 age 1.199746e-03
## 2 male 3.569646e-01
## 3 friend_cnt 5.640113e-118
## 4 avg_friend_age 9.326247e-01
## 5 avg_friend_male 8.444775e-01
## 6 friend_country_cnt 9.248697e-282
## 7 songsListened 2.141575e-40
## 8 lovedTracks 1.266920e-58
## 9 posts 1.307063e-08
## 10 playlists 7.480149e-08
## 11 shouts 1.795939e-16
## 12 tenure 1.523959e-01
## 13 good_country 4.180329e-01
# Estimating treatment effects
with(dta_m, t.test(adopter ~ treatment))
##
## Welch Two Sample t-test
##
## data: adopter by treatment
## t = -17.858, df = 18260, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.09569061 -0.07676179
## sample estimates:
## mean in group 0 mean in group 1
## 0.0913163 0.1775425
I still got significant difference of all variables between control group and treatment group after nearest method matching. So does the result of t-test between adopter and treatment. The p-value of t-test is significant. As a result, I chose another matching method, the subclass method, to see if it can balance observed covariates of two groups better.
Visual inspection to examine covariate balance in the matched sample. If the scatter points are located near the trend line, it means the matching is good. Yet the below plots show the nearest method still have improvement of well matching the treatment and control samples in all covariates.
fn_bal <- function(dta, variable) {
dta$variable <- dta[, variable]
support <- c(min(dta$variable), max(dta$variable))
ggplot(dta, aes(x = distance, y = variable, color = treatment)) +
geom_point(alpha = 0.2, size = 1.3) +
geom_smooth(method = "loess", se = F) +
xlab("Propensity score") +
ylab(variable) +
theme_bw() +
ylim(support)
}
grid.arrange(
fn_bal(dta_m, "age"),
fn_bal(dta_m, "male") + theme(legend.position = "none"),
fn_bal(dta_m, "friend_cnt"),
fn_bal(dta_m, "avg_friend_age") + theme(legend.position = "none"),
fn_bal(dta_m, "avg_friend_male"),
fn_bal(dta_m, "friend_country_cnt") + theme(legend.position = "none"),
fn_bal(dta_m, "subscriber_friend_cnt"),
fn_bal(dta_m, "songsListened") + theme(legend.position = "none"),
nrow = 4, widths = c(10, 8)
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
grid.arrange(
fn_bal(dta_m, "lovedTracks"),
fn_bal(dta_m, "posts") + theme(legend.position = "none"),
fn_bal(dta_m, "playlists"),
fn_bal(dta_m, "shouts") + theme(legend.position = "none"),
fn_bal(dta_m, "adopter"),
fn_bal(dta_m, "tenure") + theme(legend.position = "none"),
fn_bal(dta_m, "good_country"),
nrow = 4, widths = c(10, 8)
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing missing values (geom_smooth).
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Propensity Score Matching – Method 2: subclassification (subclass) The subclass method is to loosely put “similar’’ observations (both treatment and control) into the same “subclass”. In this method, the treatment and control would be split into 6 subclasses. I decided to select the subclasses with lower mean difference than the value got from the nearest neighbor method, which are subclass 2,3,4,5 and subclass 2, 4 having better performance. I ended up choosing subclass 2, it has pretty low mean difference 0.093 and the mean distances of treatment and control are 0.2532 and 0.2439.
For subclass 2, their treatment and control samples are 5026 and 1637. Looking to the overall performance of all subclasses, we got the mean distances of treatment and control 0.4982 and 0.4792, respectively. The mean difference is 0.0092. It improves balance of two groups by 94.6247%, in terms of their distance (propensity score).
Mean difference for all variables refers to the below table. All the standardized mean differences, SMD, are lower than 0.1. It means that there is no huge difference between treated and control for all variables.
#subclass matching approach using subclass method
mod_match2 <- matchit(treatment ~ log(age) + male +log(friend_cnt+1) + log(avg_friend_age+1) + log(avg_friend_male+1) + log(friend_country_cnt+1) + log(songsListened+1) + log(lovedTracks+1) + log(posts+1) + log(playlists+1) + log(shouts+1) + log(tenure+1) + good_country, method = 'subclass', data = df)
summary(mod_match2, covariates = T)
##
## Call:
## matchit(formula = treatment ~ log(age) + male + log(friend_cnt +
## 1) + log(avg_friend_age + 1) + log(avg_friend_male + 1) +
## log(friend_country_cnt + 1) + log(songsListened + 1) + log(lovedTracks +
## 1) + log(posts + 1) + log(playlists + 1) + log(shouts + 1) +
## log(tenure + 1) + good_country, data = df, method = "subclass",
## sub.by = "treat")
## Summary of balance for all data:
## Means Treated Means Control Mean Diff eQQ Med
## distance 0.4982 0.1450 0.3532 0.3898
## log(age) 3.2014 3.1389 0.0625 0.0572
## male 0.6363 0.6288 0.0074 0.0000
## log(friend_cnt + 1) 3.3261 1.9362 1.3899 1.4553
## log(avg_friend_age + 1) 3.2562 3.1912 0.0649 0.0654
## log(avg_friend_male + 1) 0.4814 0.4540 0.0274 0.0465
## log(friend_country_cnt + 1) 1.9993 1.1226 0.8767 1.0116
## log(songsListened + 1) 9.6020 7.9441 1.6579 1.2880
## log(lovedTracks + 1) 3.9582 2.4466 1.5116 1.5315
## log(posts + 1) 0.8391 0.2865 0.5526 0.0000
## log(playlists + 1) 0.4130 0.3364 0.0766 0.0000
## log(shouts + 1) 3.0357 1.7159 1.3198 1.5041
## log(tenure + 1) 3.7453 3.6546 0.0907 0.0770
## good_country 0.3433 0.3547 -0.0114 0.0000
## eQQ Mean eQQ Max
## distance 0.3532 0.5491
## log(age) 0.0628 0.1823
## male 0.0074 1.0000
## log(friend_cnt + 1) 1.3899 2.9793
## log(avg_friend_age + 1) 0.0651 0.4974
## log(avg_friend_male + 1) 0.0662 0.3102
## log(friend_country_cnt + 1) 0.8766 1.2993
## log(songsListened + 1) 1.6583 6.0283
## log(lovedTracks + 1) 1.5115 2.8332
## log(posts + 1) 0.5525 2.3214
## log(playlists + 1) 0.0765 1.0986
## log(shouts + 1) 1.3197 2.2849
## log(tenure + 1) 0.0909 1.6094
## good_country 0.0114 1.0000
##
##
## Summary of balance by subclasses:
## , , Subclass 1
##
## Means Treated Means Control Mean Diff eQQ Med
## distance 0.0958 0.0590 0.0368 0.0413
## log(age) 3.1476 3.1177 0.0299 0.0392
## male 0.6512 0.6226 0.0286 0.0000
## log(friend_cnt + 1) 1.9106 1.5368 0.3738 0.4055
## log(avg_friend_age + 1) 3.2050 3.1660 0.0389 0.0396
## log(avg_friend_male + 1) 0.4690 0.4436 0.0254 0.0155
## log(friend_country_cnt + 1) 1.0560 0.9001 0.1559 0.0000
## log(songsListened + 1) 8.2132 7.3336 0.8796 0.6968
## log(lovedTracks + 1) 2.5337 1.9785 0.5552 0.6022
## log(posts + 1) 0.2298 0.1579 0.0718 0.0000
## log(playlists + 1) 0.3265 0.3153 0.0112 0.0000
## log(shouts + 1) 1.6350 1.3627 0.2724 0.3365
## log(tenure + 1) 3.6255 3.6195 0.0060 0.0000
## good_country 0.3531 0.3566 -0.0035 0.0000
## eQQ Mean eQQ Max
## distance 0.0367 0.0517
## log(age) 0.0314 0.3090
## male 0.0287 1.0000
## log(friend_cnt + 1) 0.3739 0.6931
## log(avg_friend_age + 1) 0.0396 0.4974
## log(avg_friend_male + 1) 0.0468 0.2877
## log(friend_country_cnt + 1) 0.1560 0.6931
## log(songsListened + 1) 0.8811 3.4340
## log(lovedTracks + 1) 0.5546 1.3863
## log(posts + 1) 0.0730 2.4556
## log(playlists + 1) 0.0138 2.1972
## log(shouts + 1) 0.2711 0.8548
## log(tenure + 1) 0.0127 0.9163
## good_country 0.0037 1.0000
##
## , , Subclass 2
##
## Means Treated Means Control Mean Diff eQQ Med
## distance 0.2532 0.2439 0.0093 0.0103
## log(age) 3.1820 3.1865 -0.0045 0.0000
## male 0.6402 0.6546 -0.0144 0.0000
## log(friend_cnt + 1) 2.6798 2.6449 0.0349 0.0000
## log(avg_friend_age + 1) 3.2496 3.2505 -0.0009 0.0147
## log(avg_friend_male + 1) 0.4769 0.4807 -0.0039 0.0026
## log(friend_country_cnt + 1) 1.4616 1.4356 0.0260 0.0000
## log(songsListened + 1) 9.3979 9.3021 0.0958 0.0589
## log(lovedTracks + 1) 3.3213 3.3103 0.0110 0.0198
## log(posts + 1) 0.4710 0.4618 0.0092 0.0000
## log(playlists + 1) 0.3764 0.3716 0.0048 0.0000
## log(shouts + 1) 2.2785 2.2897 -0.0112 0.0000
## log(tenure + 1) 3.7383 3.7424 -0.0041 0.0126
## good_country 0.3561 0.3703 -0.0141 0.0000
## eQQ Mean eQQ Max
## distance 0.0094 0.0144
## log(age) 0.0194 0.1671
## male 0.0141 1.0000
## log(friend_cnt + 1) 0.0442 0.4055
## log(avg_friend_age + 1) 0.0205 0.2336
## log(avg_friend_male + 1) 0.0056 0.0760
## log(friend_country_cnt + 1) 0.0327 0.6931
## log(songsListened + 1) 0.0988 2.5649
## log(lovedTracks + 1) 0.0327 0.6931
## log(posts + 1) 0.0218 0.8694
## log(playlists + 1) 0.0199 0.6931
## log(shouts + 1) 0.0355 0.6931
## log(tenure + 1) 0.0164 0.2513
## good_country 0.0141 1.0000
##
## , , Subclass 3
##
## Means Treated Means Control Mean Diff eQQ Med
## distance 0.4145 0.4048 0.0096 0.0103
## log(age) 3.2024 3.2001 0.0022 0.0000
## male 0.6512 0.6352 0.0160 0.0000
## log(friend_cnt + 1) 3.1062 3.0719 0.0343 0.0465
## log(avg_friend_age + 1) 3.2638 3.2646 -0.0008 0.0203
## log(avg_friend_male + 1) 0.4812 0.4860 -0.0048 0.0036
## log(friend_country_cnt + 1) 1.7901 1.7722 0.0179 0.0000
## log(songsListened + 1) 9.6568 9.6260 0.0309 0.0178
## log(lovedTracks + 1) 3.7538 3.7950 -0.0412 0.0343
## log(posts + 1) 0.6524 0.6417 0.0107 0.0000
## log(playlists + 1) 0.4014 0.4104 -0.0090 0.0000
## log(shouts + 1) 2.7325 2.7427 -0.0102 0.0408
## log(tenure + 1) 3.7625 3.7453 0.0172 0.0000
## good_country 0.3470 0.3370 0.0100 0.0000
## eQQ Mean eQQ Max
## distance 0.0097 0.0170
## log(age) 0.0222 0.2624
## male 0.0165 1.0000
## log(friend_cnt + 1) 0.0641 0.4055
## log(avg_friend_age + 1) 0.0251 0.1974
## log(avg_friend_male + 1) 0.0053 0.0572
## log(friend_country_cnt + 1) 0.0433 0.6931
## log(songsListened + 1) 0.0476 1.9459
## log(lovedTracks + 1) 0.0777 0.6931
## log(posts + 1) 0.0176 0.6931
## log(playlists + 1) 0.0208 0.6931
## log(shouts + 1) 0.0589 0.6931
## log(tenure + 1) 0.0257 1.6094
## good_country 0.0104 1.0000
##
## , , Subclass 4
##
## Means Treated Means Control Mean Diff eQQ Med
## distance 0.5774 0.5681 0.0092 0.0098
## log(age) 3.2074 3.2117 -0.0044 0.0339
## male 0.6420 0.6427 -0.0007 0.0000
## log(friend_cnt + 1) 3.4863 3.4497 0.0366 0.0385
## log(avg_friend_age + 1) 3.2712 3.2716 -0.0004 0.0287
## log(avg_friend_male + 1) 0.4885 0.4772 0.0114 0.0109
## log(friend_country_cnt + 1) 2.1122 2.0989 0.0133 0.0000
## log(songsListened + 1) 9.8758 9.8732 0.0026 0.0187
## log(lovedTracks + 1) 4.2332 4.2037 0.0295 0.0744
## log(posts + 1) 0.9157 0.8223 0.0933 0.0000
## log(playlists + 1) 0.4235 0.4083 0.0152 0.0000
## log(shouts + 1) 3.1661 3.0812 0.0848 0.0690
## log(tenure + 1) 3.7982 3.7593 0.0389 0.0308
## good_country 0.3238 0.3213 0.0024 0.0000
## eQQ Mean eQQ Max
## distance 0.0092 0.0179
## log(age) 0.0347 0.4055
## male 0.0008 1.0000
## log(friend_cnt + 1) 0.0767 0.8473
## log(avg_friend_age + 1) 0.0356 0.3167
## log(avg_friend_male + 1) 0.0131 0.0796
## log(friend_country_cnt + 1) 0.0501 0.6931
## log(songsListened + 1) 0.0423 0.8708
## log(lovedTracks + 1) 0.0936 0.8702
## log(posts + 1) 0.0927 0.6931
## log(playlists + 1) 0.0147 0.6931
## log(shouts + 1) 0.0827 0.6931
## log(tenure + 1) 0.0380 0.3185
## good_country 0.0024 1.0000
##
## , , Subclass 5
##
## Means Treated Means Control Mean Diff eQQ Med
## distance 0.7421 0.7291 0.0130 0.0147
## log(age) 3.2141 3.1921 0.0220 0.0392
## male 0.6115 0.6030 0.0085 0.0000
## log(friend_cnt + 1) 3.9658 3.9556 0.0102 0.0308
## log(avg_friend_age + 1) 3.2691 3.2461 0.0230 0.0392
## log(avg_friend_male + 1) 0.4820 0.4703 0.0117 0.0140
## log(friend_country_cnt + 1) 2.4754 2.5116 -0.0362 0.0426
## log(songsListened + 1) 10.1379 10.0651 0.0728 0.0616
## log(lovedTracks + 1) 4.6110 4.5637 0.0473 0.1342
## log(posts + 1) 1.1444 1.2230 -0.0786 0.0000
## log(playlists + 1) 0.4371 0.4315 0.0056 0.0000
## log(shouts + 1) 3.7873 3.7217 0.0656 0.0392
## log(tenure + 1) 3.7989 3.7379 0.0609 0.0551
## good_country 0.3445 0.2990 0.0455 0.0000
## eQQ Mean eQQ Max
## distance 0.0130 0.0196
## log(age) 0.0301 0.3365
## male 0.0083 1.0000
## log(friend_cnt + 1) 0.0481 0.4055
## log(avg_friend_age + 1) 0.0339 0.1603
## log(avg_friend_male + 1) 0.0125 0.0509
## log(friend_country_cnt + 1) 0.0435 0.4055
## log(songsListened + 1) 0.0979 3.2581
## log(lovedTracks + 1) 0.1446 1.0986
## log(posts + 1) 0.1087 1.4976
## log(playlists + 1) 0.0187 0.8910
## log(shouts + 1) 0.0819 1.1214
## log(tenure + 1) 0.0595 0.2683
## good_country 0.0449 1.0000
##
## , , Subclass 6
##
## Means Treated Means Control Mean Diff eQQ Med
## distance 0.9058 0.8699 0.0359 0.0376
## log(age) 3.2550 3.1733 0.0817 0.0834
## male 0.6215 0.6752 -0.0537 0.0000
## log(friend_cnt + 1) 4.8071 4.5569 0.2502 0.1363
## log(avg_friend_age + 1) 3.2785 3.2334 0.0451 0.0490
## log(avg_friend_male + 1) 0.4910 0.4813 0.0097 0.0085
## log(friend_country_cnt + 1) 3.0998 2.9599 0.1399 0.1092
## log(songsListened + 1) 10.3301 10.3882 -0.0581 0.0674
## log(lovedTracks + 1) 5.2952 5.2018 0.0934 0.2087
## log(posts + 1) 1.6209 1.6282 -0.0073 0.0000
## log(playlists + 1) 0.5132 0.4153 0.0979 0.0000
## log(shouts + 1) 4.6137 4.4246 0.1892 0.1565
## log(tenure + 1) 3.7482 3.7621 -0.0139 0.0339
## good_country 0.3352 0.3077 0.0275 0.0000
## eQQ Mean eQQ Max
## distance 0.0360 0.0657
## log(age) 0.0862 0.3185
## male 0.0513 1.0000
## log(friend_cnt + 1) 0.2706 2.8447
## log(avg_friend_age + 1) 0.0522 0.2537
## log(avg_friend_male + 1) 0.0105 0.0521
## log(friend_country_cnt + 1) 0.1396 1.1823
## log(songsListened + 1) 0.1566 5.8493
## log(lovedTracks + 1) 0.2823 1.7918
## log(posts + 1) 0.1632 1.4898
## log(playlists + 1) 0.1159 3.3928
## log(shouts + 1) 0.2248 2.5425
## log(tenure + 1) 0.0423 0.3365
## good_country 0.0256 1.0000
##
##
## Sample sizes by subclasses:
## Subclass 1 Subclass 2 Subclass 3 Subclass 4 Subclass 5 Subclass 6
## Treated 1637 1637 1637 1637 1637 1638
## Control 24527 5026 2481 1251 602 117
## Total 26164 6663 4118 2888 2239 1755
##
## Summary of balance across subclasses
## Means Treated Means Control Mean Diff eQQ Med
## distance 0.4982 0.4792 0.0092 0.0207
## log(age) 3.2014 3.1803 0.0150 0.0326
## male 0.6363 0.6389 0.0109 0.0000
## log(friend_cnt + 1) 3.3261 3.2028 0.0757 0.1096
## log(avg_friend_age + 1) 3.2562 3.2387 0.0106 0.0319
## log(avg_friend_male + 1) 0.4814 0.4732 0.0054 0.0092
## log(friend_country_cnt + 1) 1.9993 1.9465 0.0359 0.0253
## log(songsListened + 1) 9.6020 9.4315 0.1484 0.1535
## log(lovedTracks + 1) 3.9582 3.8423 0.0946 0.1789
## log(posts + 1) 0.8391 0.8226 0.0237 0.0000
## log(playlists + 1) 0.4130 0.3921 0.0167 0.0000
## log(shouts + 1) 3.0357 2.9372 0.0581 0.1070
## log(tenure + 1) 3.7453 3.7278 0.0127 0.0221
## good_country 0.3433 0.3320 0.0093 0.0000
## eQQ Mean eQQ Max
## distance 0.0190 0.0311
## log(age) 0.0373 0.2998
## male 0.0199 1.0000
## log(friend_cnt + 1) 0.1463 0.9338
## log(avg_friend_age + 1) 0.0345 0.2765
## log(avg_friend_male + 1) 0.0156 0.1006
## log(friend_country_cnt + 1) 0.0775 0.7268
## log(songsListened + 1) 0.2207 2.9875
## log(lovedTracks + 1) 0.1976 1.0889
## log(posts + 1) 0.0795 1.2831
## log(playlists + 1) 0.0340 1.4269
## log(shouts + 1) 0.1258 1.0998
## log(tenure + 1) 0.0324 0.6167
## good_country 0.0168 1.0000
##
## Percent Balance Improvement:
## Mean Diff. eQQ Med eQQ Mean eQQ Max
## distance 94.6247 94.6950 94.6203 94.3445
## log(age) 66.1718 42.9202 40.5834 -64.4371
## male 64.5644 0.0000 -168.3629 0.0000
## log(friend_cnt + 1) 91.1270 92.4691 89.4753 68.6580
## log(avg_friend_age + 1) 73.0777 51.2211 47.0466 44.4094
## log(avg_friend_male + 1) 69.9300 80.2584 76.3880 67.5745
## log(friend_country_cnt + 1) 93.9768 97.4988 91.1556 44.0635
## log(songsListened + 1) 89.7115 88.0812 86.6907 50.4423
## log(lovedTracks + 1) 92.3338 88.3170 86.9274 61.5657
## log(posts + 1) 97.0081 0.0000 85.6106 44.7252
## log(playlists + 1) 72.6598 0.0000 55.6367 -29.8861
## log(shouts + 1) 92.5414 92.8857 90.4645 51.8639
## log(tenure + 1) 80.7125 71.3450 64.3242 61.6837
## good_country 1.0151 0.0000 -47.6303 0.0000
#plot(mod_match2)
dta_m2 <- match.data(mod_match2)
dta_m2 <- filter(dta_m2, subclass == 2)
#create a TableOne to see Standardized Mean Difference
match_table <- CreateTableOne(vars = xvars, strata = "treatment", data = dta_m2, test = FALSE)
print(match_table, smd = TRUE)
## Stratified by treatment
## 0 1
## n 5026 1637
## age (mean (SD)) 24.96 (6.79) 24.72 (6.02)
## male (mean (SD)) 0.65 (0.48) 0.64 (0.48)
## friend_cnt (mean (SD)) 15.55 (8.93) 15.81 (8.79)
## avg_friend_age (mean (SD)) 25.33 (5.70) 25.17 (4.83)
## avg_friend_male (mean (SD)) 0.64 (0.24) 0.63 (0.24)
## friend_country_cnt (mean (SD)) 3.67 (2.22) 3.75 (2.17)
## songsListened (mean (SD)) 23405.49 (26688.87) 24706.95 (28220.07)
## lovedTracks (mean (SD)) 98.99 (200.87) 103.87 (243.31)
## posts (mean (SD)) 3.55 (29.95) 3.43 (23.22)
## playlists (mean (SD)) 0.58 (0.76) 0.63 (1.13)
## shouts (mean (SD)) 24.26 (69.94) 26.55 (84.56)
## tenure (mean (SD)) 46.32 (19.65) 46.42 (19.99)
## good_country (mean (SD)) 0.37 (0.48) 0.36 (0.48)
## Stratified by treatment
## SMD
## n
## age (mean (SD)) 0.039
## male (mean (SD)) 0.030
## friend_cnt (mean (SD)) 0.029
## avg_friend_age (mean (SD)) 0.029
## avg_friend_male (mean (SD)) 0.024
## friend_country_cnt (mean (SD)) 0.036
## songsListened (mean (SD)) 0.047
## lovedTracks (mean (SD)) 0.022
## posts (mean (SD)) 0.004
## playlists (mean (SD)) 0.052
## shouts (mean (SD)) 0.029
## tenure (mean (SD)) 0.005
## good_country (mean (SD)) 0.029
#alternative way to see mean difference
df_colnames <- names(dta_m2)[c(-1, -8, -14, -17, -18, -19)]
dta_m2 %>%
group_by(treatment) %>%
select(one_of(df_colnames)) %>%
summarise_all(funs(mean))
## Adding missing grouping variables: `treatment`
## # A tibble: 2 x 15
## treatment age male friend_cnt avg_friend_age avg_friend_male
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 25.0 0.655 15.5 25.3 0.635
## 2 1 24.7 0.640 15.8 25.2 0.629
## # … with 9 more variables: friend_country_cnt <dbl>, songsListened <dbl>,
## # lovedTracks <dbl>, posts <dbl>, playlists <dbl>, shouts <dbl>,
## # tenure <dbl>, good_country <dbl>, subclass <dbl>
#t=test
output = NULL
for (i in 1:13){
name = df_colnames[i]
p_value = t.test(dta_m2[, df_colnames[i]] ~ dta_m2$treatment)$p.value
output <- rbind(output, data.frame(name, p_value))
}
print(output)
## name p_value
## 1 age 0.16059481
## 2 male 0.29083550
## 3 friend_cnt 0.30200462
## 4 avg_friend_age 0.28573427
## 5 avg_friend_male 0.39390220
## 6 friend_country_cnt 0.20644393
## 7 songsListened 0.10070373
## 8 lovedTracks 0.46305805
## 9 posts 0.87305044
## 10 playlists 0.09660396
## 11 shouts 0.32230035
## 12 tenure 0.87002629
## 13 good_country 0.30081295
In terms of t-test, all p-value in the subclass2 are not statistically significant (>0.05). So, there is no different between treatment and control group, which means that the observed covariates of two groups are balance.
Visual Inspection: most of points are located near the trend line
#Examining covariate balance in the matched sample
grid.arrange(
fn_bal(dta_m2, "age"),
fn_bal(dta_m2, "male") + theme(legend.position = "none"),
fn_bal(dta_m2, "friend_cnt"),
fn_bal(dta_m2, "avg_friend_age") + theme(legend.position = "none"),
fn_bal(dta_m2, "avg_friend_male"),
fn_bal(dta_m2, "friend_country_cnt") + theme(legend.position = "none"),
fn_bal(dta_m2, "subscriber_friend_cnt"),
fn_bal(dta_m2, "songsListened") + theme(legend.position = "none"),
nrow = 4, widths = c(10, 8)
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
grid.arrange(
fn_bal(dta_m2, "lovedTracks"),
fn_bal(dta_m2, "posts") + theme(legend.position = "none"),
fn_bal(dta_m2, "playlists"),
fn_bal(dta_m2, "shouts") + theme(legend.position = "none"),
fn_bal(dta_m2, "adopter"),
fn_bal(dta_m2, "tenure") + theme(legend.position = "none"),
fn_bal(dta_m2, "good_country"),
nrow = 4, widths = c(10, 8)
)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
#put log data back to df
#for (col in df_colnames) {
# tmp = mutate(dta_m, col = log(dta_m[,col]))
#}
dta_m = mutate(dta_m, age_log = log(dta_m[,'age']+1))
dta_m = mutate(dta_m, male_new = dta_m[,'male'])
dta_m = mutate(dta_m, friend_cnt_log = log(dta_m[,'friend_cnt'])+1)
dta_m = mutate(dta_m, avg_friend_age_log = log(dta_m[,'avg_friend_age']+1))
dta_m = mutate(dta_m, avg_friend_male_log = log(dta_m[,'avg_friend_male']+1))
dta_m = mutate(dta_m, friend_country_cnt_log = log(dta_m[,'friend_country_cnt']+1))
dta_m = mutate(dta_m, songsListened_log = log(dta_m[,'songsListened']+1))
dta_m = mutate(dta_m, lovedTracks_log = log(dta_m[,'lovedTracks']+1))
dta_m = mutate(dta_m, posts_log = log(dta_m[,'posts']+1))
dta_m = mutate(dta_m, playlists_log = log(dta_m[,'playlists']+1))
dta_m = mutate(dta_m, shouts_log = log(dta_m[,'shouts']+1))
dta_m = mutate(dta_m, tenure_log = log(dta_m[,'tenure']+1))
dta_m = mutate(dta_m, good_country_new = dta_m[,'good_country'])
dta_m2 = mutate(dta_m2, age_log = log(dta_m2[,'age']+1))
dta_m2 = mutate(dta_m2, male_new = dta_m2[,'male'])
dta_m2 = mutate(dta_m2, friend_cnt_log = log(dta_m2[,'friend_cnt'])+1)
dta_m2 = mutate(dta_m2, avg_friend_age_log = log(dta_m2[,'avg_friend_age']+1))
dta_m2 = mutate(dta_m2, avg_friend_male_log = log(dta_m2[,'avg_friend_male']+1))
dta_m2 = mutate(dta_m2, friend_country_cnt_log = log(dta_m2[,'friend_country_cnt']+1))
dta_m2 = mutate(dta_m2, songsListened_log = log(dta_m2[,'songsListened']+1))
dta_m2 = mutate(dta_m2, lovedTracks_log = log(dta_m2[,'lovedTracks']+1))
dta_m2 = mutate(dta_m2, posts_log = log(dta_m2[,'posts']+1))
dta_m2 = mutate(dta_m2, playlists_log = log(dta_m2[,'playlists']+1))
dta_m2 = mutate(dta_m2, shouts_log = log(dta_m2[,'shouts']+1))
dta_m2 = mutate(dta_m2, tenure_log = log(dta_m2[,'tenure']+1))
dta_m2 = mutate(dta_m2, good_country_new = dta_m2[,'good_country'])
#mean differences of two natching methods
cols <- colnames(dta_m)[20:32]
#1
match_table <- CreateTableOne(vars = cols, strata = "treatment", data = dta_m, test = FALSE)
print(match_table, smd = TRUE)
## Stratified by treatment
## 0 1 SMD
## n 9823 9823
## age_log (mean (SD)) 3.23 (0.24) 3.24 (0.24) 0.053
## male_new (mean (SD)) 0.64 (0.48) 0.64 (0.48) 0.013
## friend_cnt_log (mean (SD)) 3.76 (0.95) 4.25 (1.19) 0.460
## avg_friend_age_log (mean (SD)) 3.25 (0.20) 3.26 (0.18) 0.027
## avg_friend_male_log (mean (SD)) 0.48 (0.16) 0.48 (0.15) 0.003
## friend_country_cnt_log (mean (SD)) 1.64 (0.63) 2.00 (0.81) 0.498
## songsListened_log (mean (SD)) 9.36 (1.76) 9.60 (1.75) 0.135
## lovedTracks_log (mean (SD)) 3.53 (1.92) 3.96 (2.05) 0.216
## posts_log (mean (SD)) 0.58 (1.07) 0.84 (1.36) 0.208
## playlists_log (mean (SD)) 0.39 (0.41) 0.41 (0.46) 0.054
## shouts_log (mean (SD)) 2.52 (1.43) 3.04 (1.69) 0.326
## tenure_log (mean (SD)) 3.73 (0.55) 3.75 (0.53) 0.024
## good_country_new (mean (SD)) 0.35 (0.48) 0.34 (0.47) 0.012
#
#dta_m %>%
# group_by(treatment) %>%
# select(one_of(cols)) %>%
# summarise_all(funs(mean))
#2
match_table <- CreateTableOne(vars = cols, strata = "treatment", data = dta_m2, test = FALSE)
print(match_table, smd = TRUE)
## Stratified by treatment
## 0 1 SMD
## n 5026 1637
## age_log (mean (SD)) 3.23 (0.23) 3.22 (0.21) 0.020
## male_new (mean (SD)) 0.65 (0.48) 0.64 (0.48) 0.030
## friend_cnt_log (mean (SD)) 3.55 (0.69) 3.59 (0.63) 0.064
## avg_friend_age_log (mean (SD)) 3.25 (0.19) 3.25 (0.17) 0.005
## avg_friend_male_log (mean (SD)) 0.48 (0.15) 0.48 (0.15) 0.025
## friend_country_cnt_log (mean (SD)) 1.44 (0.46) 1.46 (0.44) 0.058
## songsListened_log (mean (SD)) 9.30 (1.67) 9.40 (1.58) 0.059
## lovedTracks_log (mean (SD)) 3.31 (1.86) 3.32 (1.88) 0.006
## posts_log (mean (SD)) 0.46 (0.92) 0.47 (0.91) 0.010
## playlists_log (mean (SD)) 0.37 (0.40) 0.38 (0.43) 0.011
## shouts_log (mean (SD)) 2.29 (1.27) 2.28 (1.30) 0.009
## tenure_log (mean (SD)) 3.74 (0.53) 3.74 (0.55) 0.008
## good_country_new (mean (SD)) 0.37 (0.48) 0.36 (0.48) 0.029
treatment effect Finally, I used logistic regression to double check the result of the models to see if the treatment and control samples got balanced.
#t-test
output = NULL
for (i in 1:13){
name = cols[i]
p_value = t.test(dta_m[, cols[i]] ~ dta_m$treatment)$p.value
output <- rbind(output, data.frame(name, p_value))
}
print(output) #all significant
## name p_value
## 1 age_log 1.979079e-04
## 2 male_new 3.569646e-01
## 3 friend_cnt_log 7.789995e-222
## 4 avg_friend_age_log 5.738469e-02
## 5 avg_friend_male_log 8.501290e-01
## 6 friend_country_cnt_log 4.196158e-258
## 7 songsListened_log 3.106254e-21
## 8 lovedTracks_log 1.843262e-51
## 9 posts_log 9.008717e-48
## 10 playlists_log 1.459077e-04
## 11 shouts_log 3.699910e-114
## 12 tenure_log 9.003016e-02
## 13 good_country_new 4.180329e-01
#treatment effect
lm_test1 <- glm(adopter~treatment, data = dta_m, family = binomial(), )
summary(lm_test1)
##
## Call:
## glm(formula = adopter ~ treatment, family = binomial(), data = dta_m)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6252 -0.6252 -0.4376 -0.4376 2.1879
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.29767 0.03503 -65.60 <2e-16 ***
## treatment 0.76458 0.04386 17.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15509 on 19645 degrees of freedom
## Residual deviance: 15191 on 19644 degrees of freedom
## AIC: 15195
##
## Number of Fisher Scoring iterations: 5
lm_test2 <- glm(adopter~treatment + age_log + male_new + friend_cnt_log + avg_friend_age_log
+ avg_friend_male_log + friend_country_cnt_log + songsListened_log + lovedTracks_log
+ posts_log + playlists_log + shouts_log + tenure_log, data = dta_m, family = binomial())
summary(lm_test2)
##
## Call:
## glm(formula = adopter ~ treatment + age_log + male_new + friend_cnt_log +
## avg_friend_age_log + avg_friend_male_log + friend_country_cnt_log +
## songsListened_log + lovedTracks_log + posts_log + playlists_log +
## shouts_log + tenure_log, family = binomial(), data = dta_m)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6856 -0.5752 -0.4304 -0.2971 2.9661
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.242989 0.501279 -18.439 < 2e-16 ***
## treatment 0.578626 0.047585 12.160 < 2e-16 ***
## age_log 0.608481 0.148409 4.100 4.13e-05 ***
## male_new 0.261536 0.049733 5.259 1.45e-07 ***
## friend_cnt_log 0.070675 0.038618 1.830 0.06723 .
## avg_friend_age_log 0.873225 0.201824 4.327 1.51e-05 ***
## avg_friend_male_log 0.182963 0.156736 1.167 0.24308
## friend_country_cnt_log 0.006577 0.048278 0.136 0.89164
## songsListened_log 0.204511 0.019736 10.362 < 2e-16 ***
## lovedTracks_log 0.251007 0.013486 18.613 < 2e-16 ***
## posts_log 0.128037 0.018053 7.092 1.32e-12 ***
## playlists_log 0.154962 0.048393 3.202 0.00136 **
## shouts_log -0.085773 0.019328 -4.438 9.09e-06 ***
## tenure_log -0.349142 0.048227 -7.240 4.50e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15509 on 19645 degrees of freedom
## Residual deviance: 14115 on 19632 degrees of freedom
## AIC: 14143
##
## Number of Fisher Scoring iterations: 5
#treatment effect
lm_test3 <- glm(adopter~treatment, data = dta_m2, family = binomial(), )
summary(lm_test3)
##
## Call:
## glm(formula = adopter ~ treatment, family = binomial(), data = dta_m2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5682 -0.4025 -0.4025 -0.4025 2.2599
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.47268 0.05266 -46.954 <2e-16 ***
## treatment 0.73064 0.08712 8.387 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4192.9 on 6662 degrees of freedom
## Residual deviance: 4126.3 on 6661 degrees of freedom
## AIC: 4130.3
##
## Number of Fisher Scoring iterations: 5
lm_test4 <- glm(adopter~treatment + age_log + male_new + friend_cnt_log + avg_friend_age_log
+ avg_friend_male_log + friend_country_cnt_log + songsListened_log + lovedTracks_log
+ posts_log + playlists_log + shouts_log + tenure_log, data = dta_m2, family = binomial())
summary(lm_test4)
##
## Call:
## glm(formula = adopter ~ treatment + age_log + male_new + friend_cnt_log +
## avg_friend_age_log + avg_friend_male_log + friend_country_cnt_log +
## songsListened_log + lovedTracks_log + posts_log + playlists_log +
## shouts_log + tenure_log, family = binomial(), data = dta_m2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5171 -0.4840 -0.3640 -0.2601 2.9942
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -17.06130 3.18160 -5.362 8.21e-08 ***
## treatment 0.71601 0.09032 7.927 2.24e-15 ***
## age_log 0.92899 0.32490 2.859 0.004245 **
## male_new 0.27716 0.10186 2.721 0.006507 **
## friend_cnt_log 0.55987 0.18248 3.068 0.002154 **
## avg_friend_age_log 2.42208 0.73048 3.316 0.000914 ***
## avg_friend_male_log 0.50759 0.29989 1.693 0.090532 .
## friend_country_cnt_log 0.23115 0.13531 1.708 0.087571 .
## songsListened_log 0.22339 0.03900 5.728 1.01e-08 ***
## lovedTracks_log 0.33988 0.03015 11.272 < 2e-16 ***
## posts_log 0.13697 0.05079 2.697 0.006999 **
## playlists_log 0.29315 0.10631 2.757 0.005827 **
## shouts_log -0.21240 0.04138 -5.133 2.86e-07 ***
## tenure_log -0.57488 0.11308 -5.084 3.70e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4192.9 on 6662 degrees of freedom
## Residual deviance: 3832.6 on 6649 degrees of freedom
## AIC: 3860.6
##
## Number of Fisher Scoring iterations: 6
It shows that when only looking to the relationship between adopter and treatment, the p-value is significant. The coefficient of treatment variable is 0.73064, positively correlated with the adopter. However, if I put all the other variables (demographic variables, peer influence, user engagement variables) into logistic regression, the coefficient of treatment is still 0.71 or so, close to 0.73. Therefore, it means that whether or not I put other variables, the correlation of adopter and treatment is not affected. The treatment effect is solved.
#4. Regression Analyses #logistic regression
#put all variables into the regression
lm1<- lm(adopter~ log(age) + male +log(friend_cnt+1) + log(avg_friend_age+1) + log(avg_friend_male+1) +
log(friend_country_cnt+1) + log(subscriber_friend_cnt + 1) + log(songsListened+1) + log(lovedTracks+1) + log(posts+1) +log(playlists+1) + log(shouts+1) + log(tenure+1) + good_country, data = df, family = binomial())
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'family' will be disregarded
summary(lm1)
##
## Call:
## lm(formula = adopter ~ log(age) + male + log(friend_cnt + 1) +
## log(avg_friend_age + 1) + log(avg_friend_male + 1) + log(friend_country_cnt +
## 1) + log(subscriber_friend_cnt + 1) + log(songsListened +
## 1) + log(lovedTracks + 1) + log(posts + 1) + log(playlists +
## 1) + log(shouts + 1) + log(tenure + 1) + good_country, data = df,
## family = binomial())
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65446 -0.10987 -0.05571 -0.00430 1.06422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2652221 0.0232335 -11.416 < 2e-16 ***
## log(age) 0.0625975 0.0079841 7.840 4.60e-15 ***
## male 0.0225384 0.0026948 8.364 < 2e-16 ***
## log(friend_cnt + 1) 0.0005592 0.0023392 0.239 0.8111
## log(avg_friend_age + 1) 0.0243932 0.0102037 2.391 0.0168 *
## log(avg_friend_male + 1) 0.0031592 0.0060004 0.526 0.5985
## log(friend_country_cnt + 1) -0.0014172 0.0032037 -0.442 0.6582
## log(subscriber_friend_cnt + 1) 0.0912123 0.0033791 26.993 < 2e-16 ***
## log(songsListened + 1) 0.0055781 0.0006097 9.150 < 2e-16 ***
## log(lovedTracks + 1) 0.0203240 0.0007199 28.232 < 2e-16 ***
## log(posts + 1) 0.0161201 0.0015031 10.724 < 2e-16 ***
## log(playlists + 1) 0.0150537 0.0031679 4.752 2.02e-06 ***
## log(shouts + 1) -0.0095469 0.0013196 -7.235 4.74e-13 ***
## log(tenure + 1) -0.0138152 0.0025582 -5.400 6.68e-08 ***
## good_country -0.0294464 0.0026364 -11.169 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.259 on 43812 degrees of freedom
## Multiple R-squared: 0.09358, Adjusted R-squared: 0.09329
## F-statistic: 323.1 on 14 and 43812 DF, p-value: < 2.2e-16
First, use all variables (continuous variables is taken log transformation) into logistic regression. Adopter is dependent variable. The summary result shows that friend_cnt, avg_friend_male_log and friend_country_cnt_log are not significant, based on the alpha = 0.05. So, I removed these two variables in the second logistic regression model.
In the second model, all variables are statistically significant.
#remove non-significant variables
lm2<- lm(adopter~ log(age) + male + log(avg_friend_age+1) + log(subscriber_friend_cnt + 1) + log(songsListened+1) + log(lovedTracks+1) + log(posts+1) +log(playlists+1) + log(shouts+1) + log(tenure+1) + good_country, data = df, family = binomial())
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'family' will be disregarded
summary(lm2)
##
## Call:
## lm(formula = adopter ~ log(age) + male + log(avg_friend_age +
## 1) + log(subscriber_friend_cnt + 1) + log(songsListened +
## 1) + log(lovedTracks + 1) + log(posts + 1) + log(playlists +
## 1) + log(shouts + 1) + log(tenure + 1) + good_country, data = df,
## family = binomial())
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65458 -0.10992 -0.05563 -0.00434 1.06461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2659995 0.0228422 -11.645 < 2e-16 ***
## log(age) 0.0621742 0.0079405 7.830 4.99e-15 ***
## male 0.0226576 0.0026892 8.425 < 2e-16 ***
## log(avg_friend_age + 1) 0.0252542 0.0099806 2.530 0.0114 *
## log(subscriber_friend_cnt + 1) 0.0909001 0.0030221 30.078 < 2e-16 ***
## log(songsListened + 1) 0.0056192 0.0005777 9.726 < 2e-16 ***
## log(lovedTracks + 1) 0.0202868 0.0007079 28.658 < 2e-16 ***
## log(posts + 1) 0.0161080 0.0015013 10.730 < 2e-16 ***
## log(playlists + 1) 0.0149707 0.0031653 4.730 2.26e-06 ***
## log(shouts + 1) -0.0096372 0.0011346 -8.494 < 2e-16 ***
## log(tenure + 1) -0.0137723 0.0025563 -5.388 7.18e-08 ***
## good_country -0.0294245 0.0026313 -11.183 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.259 on 43815 degrees of freedom
## Multiple R-squared: 0.09357, Adjusted R-squared: 0.09334
## F-statistic: 411.2 on 11 and 43815 DF, p-value: < 2.2e-16
#odd ratio
exp(lm2$coefficients)
## (Intercept) log(age)
## 0.7664395 1.0641477
## male log(avg_friend_age + 1)
## 1.0229162 1.0255758
## log(subscriber_friend_cnt + 1) log(songsListened + 1)
## 1.0951596 1.0056350
## log(lovedTracks + 1) log(posts + 1)
## 1.0204940 1.0162385
## log(playlists + 1) log(shouts + 1)
## 1.0150833 0.9904091
## log(tenure + 1) good_country
## 0.9863221 0.9710042
Based on ANOVA, the model 2 is better for simplicity.
#ANOVA
anova(lm1, lm2, test="Chisq")
## Analysis of Variance Table
##
## Model 1: adopter ~ log(age) + male + log(friend_cnt + 1) + log(avg_friend_age +
## 1) + log(avg_friend_male + 1) + log(friend_country_cnt +
## 1) + log(subscriber_friend_cnt + 1) + log(songsListened +
## 1) + log(lovedTracks + 1) + log(posts + 1) + log(playlists +
## 1) + log(shouts + 1) + log(tenure + 1) + good_country
## Model 2: adopter ~ log(age) + male + log(avg_friend_age + 1) + log(subscriber_friend_cnt +
## 1) + log(songsListened + 1) + log(lovedTracks + 1) + log(posts +
## 1) + log(playlists + 1) + log(shouts + 1) + log(tenure +
## 1) + good_country
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 43812 2939.7
## 2 43815 2939.7 -3 -0.033112 0.9203
#multicollinearity
library(car)
## Warning: package 'car' was built under R version 3.5.2
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
vif(lm2)
## log(age) male
## 2.299142 1.100530
## log(avg_friend_age + 1) log(subscriber_friend_cnt + 1)
## 2.234923 1.360421
## log(songsListened + 1) log(lovedTracks + 1)
## 1.488033 1.431314
## log(posts + 1) log(playlists + 1)
## 1.351067 1.109004
## log(shouts + 1) log(tenure + 1)
## 1.793807 1.372650
## good_country
## 1.031778
The AUC of ROC Curve is 0.791, which is great.
#ROC
library(pROC)
## Warning: package 'pROC' was built under R version 3.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
g = roc(df$adopter ~ predict(lm2, type = "response"))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(g, print.thres = "best" ,print.auc=TRUE)
#find the best threshold
best_thres = coords(g, "best", ret=c("threshold", "specificity", "sensitivity"))
## Warning in coords.roc(g, "best", ret = c("threshold", "specificity",
## "sensitivity")): The 'transpose' argument to FALSE by default since pROC
## 1.16. Set transpose = TRUE explicitly to revert to the previous behavior,
## or transpose = TRUE to silence this warning. Type help(coords_transpose)
## for additional information.
print(best_thres)
## threshold specificity sensitivity
## 1 0.09444048 0.6613896 0.7839524
#for 1 unit increase in log(age), adopter will increase by the factor of exp()
Interpretation: For every 1 unit increase in age, adopter will increase by a factor of 1.064. For every 1 unit increase in male, adopter will increase by a factor of 1.023. For every 1 unit increase in average friend’s age (avg_freind_age), adopter will increase by a factor of 1.026. For every 1 unit increase in subscriber friend count (subscriber_friend_cnt), adopter will increase by a factor of 1.095. For every 1 unit increase in songListened, adopter will increase by a factor of 1.006. For every 1 unit increase in loveTracked, adopter will increase by a factor of 1.020. For every 1 unit increase in posts, adopter will increase by a factor of 1.016. For every 1 unit increase in playlists, adopter will increase by a factor of 1.015. For every 1 unit increase in shouts, adopter will decrease by a factor of 1.020. For every 1 unit increase in tenure, adopter will decrease by a factor of 0.986. For every 1 unit increase in good country, adopter will decrease by a factor of 0.971.
To sum up, the age, male, friend average age, subscriber friend count, song listened, love tracked, posts, and playlists are the variables which has positive correlation with adopter. The variables of shouts, tenure, good country are negatively correlated with adopter, which means that these the higher three variables are, it is less likely to become premium users (adopter = 1).
#5. Free-to-Fee strategy for HighNote According to the descriptive statistics and the models we built in the previous sections, it shows that, for example, the demographic variables, such as male and age, are positive related to the adopter. If a user who is a male with elder age, then he might be the target users for HighNote. So, it might be a good marketing strategy for HighNote to segment their users and find the aftermentioned target users (elder men) to do more advertising or personalized marketing.
For the peer influence variables, if a user actively interacts with lots of subscriber friends whose age belongs to elder age, the user is more likely to continuously engage on HighNote and become the target premium users. So, the company could create more activities among users, such as community or word-of-mouth marketing, to increase peer influence and further convert free users to premium users.
Regarding the user engagement variables, the company could create more interactions between company and users, such as public relation campaigns about sharing listened songs, loved tracks or playlists. These might help engaged users and keep using on this platform. Eventually, they will become premium users.
Last but not the least, High Note should revamp several functions or services on platform, such as shouts or tenure, to attract users. Also, it is better for High Note to expand its oversea market, not just focused on US, UK, Germany since based on the model result, many premium users are from other countries outside US, UK, Germany and so forth. Therefore, it is a good chance to explore new market and reach out to the target users.